Skip to content

fix(state): pending sessions — stop holding the exclusive lock through provisioning#20

Merged
hefgi merged 1 commit into
mainfrom
claude/magical-einstein-7m4g40-lock
Jun 11, 2026
Merged

fix(state): pending sessions — stop holding the exclusive lock through provisioning#20
hefgi merged 1 commit into
mainfrom
claude/magical-einstein-7m4g40-lock

Conversation

@hefgi

@hefgi hefgi commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Fixes #1

Problem

cmd_up acquired the exclusive state lock and held it through the entire bring_up — docker pulls, compose up, worktree creation, and user-defined hooks (no timeout) — and lock acquisition times out after 10s. Consequences for the core use case (N agents in parallel):

  • two simultaneous ecluse ups serialized; the second failed with LockTimeout whenever the first took >10s (first-time image pull, migration hook)
  • ls / env / status / shell (shared lock) failed while any up was in flight
  • down/shutdown showed interactive prompts while holding the exclusive lock — an unanswered prompt blocked every ecluse command in the repo

Design

Sessions get a status: active | pending field (#[serde(default)] + skipped when active, so existing state.json files round-trip byte-identical).

up (new session): short exclusive section → allocate slot → commit a pending session (reserves slug + slot against concurrent ups) → release lock → provision → re-acquire → swap in the real session, or drop the reservation on failure (bring_up already rolled its resources back, see #2/#19).

up (resume): same pattern — mark pending, release, health-check + start services, re-acquire, replace or restore.

down: resolve target from a shared-lock snapshot, mark pending in a short exclusive section, prompt and tear down without the lock, then remove (or restore as active if teardown failed).

shutdown: per-session prompts and teardown all run outside the lock, with re-verification under the lock before each marking.

Pending sessions:

  • still reserve their slot (used_slots unchanged)
  • show as slug (pending) in ls (and "status": "pending" in --json)
  • up/env/status/shell against them error actionably ("operation in progress… run ecluse down <slug> if it crashed")
  • down works on them — that's the recovery path for an up that crashed between reserve and finalize

Tests

  • state: status serde round-trip, old-state-file default to active, active-not-serialized (byte compat), pending-reserves-slot
  • integration (real binary, slow post_up = "sleep 3" hook):
    • ls_works_while_up_is_provisioningls returns promptly with the (pending) marker while up sleeps; this timed out after 10s before the fix
    • up_on_pending_session_errors_actionably — concurrent up on the same slug can't race the reservation
    • failed_up_removes_pending_reservation — failed provisioning frees the slot and leaves ls empty

cargo fmt --check, cargo clippy -- -D warnings, cargo test (373 + 21) green.

Note: this updates the AGENTS.md-stated invariant "the lock is held for the entire duration of up and down" in spirit — the docs file itself is fixed in the docs-drift PR for #15. cmd_sync still does its (fast, discovery-only) work under the lock; unchanged here.

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko


Generated by Claude Code

cmd_up held the exclusive state lock across the entire bring_up — docker
pulls, compose up, worktree creation, and user hooks with no timeout —
while lock acquisition gives up after 10s. Parallel `up`s serialized and
failed with LockTimeout, and every read command (ls/env/status/shell)
failed while any up was in flight. down and shutdown additionally showed
interactive prompts while holding the lock.

Sessions now carry a status (active | pending, serde-compatible with old
state files). up reserves the slug + slot by committing a pending session
in a short exclusive section, releases the lock, provisions, then
re-acquires to finalize (or drop the reservation on failure — bring_up
already rolled back the resources). Resume and down mark the session
pending the same way and restore it on failure; shutdown resolves prompts
and runs teardown outside the lock per session.

Pending sessions: reserve their slot, show as '(pending)' in ls, are
refused by up/env/status/shell with an actionable error, and can be
cleaned with `ecluse down <slug>` if the owning operation crashed.

Fixes #1

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
@hefgi hefgi merged commit e692aac into main Jun 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exclusive state lock is held through all of provisioning — parallel up serializes and everything else hits LockTimeout

2 participants